🤔 How do we use distributions of parameters (prior or posterior) to make predictions?
Two sources of variability:
Distributional (Prior or Posterior) Variability
Sampling (Data) Variability
How certain are we about the value of the parameters?
How much will observed data typically vary from its expected value? a.k.a. inherent variability in the data around the true value.
\[ p(\tilde{x} \mid x) \]
Probability of new sample \(\tilde{x}\) given the current data \(x\).
draw \(\theta_i \sim \pi(\theta)\) (one sample of all the model parameters)
draw \(y_i \sim p(y_i \mid \theta_i)\) (sample of data from model implied by \(\theta_i\))
1 accounts for uncertainty about the parameters \(\theta\), 2 accounts for uncertainty in the sample
Two sources of variability:
Distributional (Prior or Posterior) Variability
Sampling (Data) Variability
Prior predictive checks allow us to simulate samples from our prior and make sure it lines up with our expectations about reality.
Let’s adjust our priors to be more realistic.
Posterior predictive checks do something similar, but simulates new data based on the posterior. This allows us to “look for systematic discrepancies between real and simulated data” (Gelman et al. 2004) which tells us a little about model fit.
Remember, p-values tell you:
given a distribution \(\pi(\theta)\), what is the probability of observing a \(\theta\) as or more extreme as the one we did? In other words, how consistent is \(\theta\) with \(\pi(\theta)\)?
# generate samples from posterior
yrep <- posterior_predict(m1, draws = 500)
# overall mean function
overall_mu <- function(x) mean(x)
# calc for actual data
overall_mu(df$y) [1] 0.236782
mtcars DataRows: 32
Columns: 11
$ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
$ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
$ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
$ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
$ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
$ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
Family: gaussian
Links: mu = identity; sigma = identity
Formula: mpg ~ hp + gear
Data: mtcars (Number of observations: 32)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 18.09 3.38 11.37 24.85 1.00 4413 3011
hp -0.06 0.01 -0.08 -0.05 1.00 4294 2990
gear 3.10 0.80 1.50 4.73 1.00 4682 3090
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 3.22 0.44 2.49 4.21 1.00 3242 2756
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Family: gaussian
Links: mu = identity; sigma = identity
Formula: mpg ~ hp + gear
Data: mtcars (Number of observations: 32)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 18.09 3.38 11.37 24.85 1.00 4413 3011
hp -0.06 0.01 -0.08 -0.05 1.00 4294 2990
gear 3.10 0.80 1.50 4.73 1.00 4682 3090
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 3.22 0.44 2.49 4.21 1.00 3242 2756
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
(more on ROPE with a brms model here)
# s/o to Chat GPT for helping simulat the data
# libs
library(brms)
library(bayesplot)
library(tidybayes)
# sim data
set.seed(540)
n <- 200
parental_income <- rnorm(n, mean = 50, sd = 10) # income_in_k
z <- 1 + 0.05 * parental_income + rnorm(n, 0, 1)
p <- 1 / (1 + exp(-z))
passed_exam <- rbinom(n, 1, p)
df <- data.frame(parental_income, passed_exam)
# priors
priors <- c(
prior(normal(0, 5), class = "b"),
prior(normal(0, 5), class = "Intercept")
)
# fit
fit <- brm(passed_exam ~ parental_income, data = df,
family = bernoulli(), prior = priors,
seed = 123, chains = 4, cores = 4, iter = 2000)
# summary
summary(fit) Family: bernoulli
Links: mu = logit
Formula: passed_exam ~ parental_income
Data: df (Number of observations: 200)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 1.30 1.69 -2.08 4.61 1.00 2293 2106
parental_income 0.04 0.03 -0.03 0.11 1.00 2132 1965
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
# s/o to Chat GPT for helping simulate the data
# Load required libraries
library(brms)
library(bayesplot)
library(tidybayes)
# sim
set.seed(123)
n_schools <- 10
n_students <- 20
total_n <- n_schools * n_students
school <- factor(rep(1:n_schools, each = n_students))
parental_income <- rnorm(total_n, mean = 50, sd = 10) # Parental income in thousands of dollars
school_effect <- rnorm(n_schools, 0, 5)[school]
individual_error <- rnorm(total_n, 0, 5)
exam_score <- 50 + 0.5 * parental_income + school_effect + individual_error
df <- data.frame(school, parental_income, exam_score)
# priors
priors <- c(
prior(normal(0, 10), class = "b"),
prior(normal(50, 10), class = "Intercept"),
prior(exponential(1), class = "sd")
)
# fit
fit <- brm(exam_score ~ parental_income + (1 + parental_income | school),
data = df, family = gaussian(), prior = priors,
seed = 123, chains = 4, cores = 4, iter = 2000)
# summary
summary(fit) Family: gaussian
Links: mu = identity; sigma = identity
Formula: exam_score ~ parental_income + (1 + parental_income | school)
Data: df (Number of observations: 200)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~school (Number of levels: 10)
Estimate Est.Error l-95% CI u-95% CI Rhat
sd(Intercept) 1.03 0.97 0.03 3.66 1.00
sd(parental_income) 0.11 0.03 0.06 0.19 1.00
cor(Intercept,parental_income) -0.02 0.55 -0.94 0.93 1.01
Bulk_ESS Tail_ESS
sd(Intercept) 2631 1886
sd(parental_income) 1048 1628
cor(Intercept,parental_income) 348 1082
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 51.74 1.99 47.84 55.71 1.00 5261 2955
parental_income 0.48 0.05 0.37 0.58 1.00 2121 1788
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 4.94 0.26 4.45 5.47 1.00 4369 2284
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Null sampling distribution is where we get our p-values, and critical values.
Alternative sampling distribution is where we get our confidence intervals.
t-test
z-test
chi-square test
z-statistic:
\[ z = \frac{\bar{x} - \mu_0}{se} \]
where \(\bar{x}\) is the observed sample statistic, \(\mu_0\) is the sample statistic expected under \(H_0\), and \(se\) is the sample standard error.
z-statistics measure how far away the sample statistic is from the expected \(H_0\) statistic. It’s a z-score using the null distribution \(\mathcal{N}(\mu_0, se)\)
❓when sample statistics are compatible with \(H_0\), what type of z-statistics would you expect to see?
Because we divide by \(s\), z-scores are standardized, meaning that we can use the same standard cutoffs to create Neyman-Pearson rejection regions! z-statistics that are more extreme than \(z = \pm 1.645\) account for the 10% most extreme values. For the 5% most extreme we use \(z = \pm 1.96\).
In a Fisherian or NHST framework, the z-distribution can also help us calculate p-values. Here \(p = 0.7\).
Common z-tests:
z-tests for means (known \(\sigma\))
z-tests for proportions (believe it or not, \(\sigma\) also known)
t-statistic:
\[ t = \frac{\bar{x} - \mu_0}{se} \]
where \(\bar{x}\) is the observed sample statistic, \(\mu_0\) is the sample statistic expected under \(H_0\), and \(se\) is the standard error.
t-statistics also measure how far away the sample statistic is from the expected \(H_0\) statistic.
❓when sample statistics are compatible with \(H_0\), what type of t-statistics would you expect to see?
Because we divide by \(se\), t-scores are standardized, meaning that we can use the same standard cutoffs (depending on df) to create Neyman-Pearson rejection regions!
In a Fisherian or NHST framework, the z-distribution can also help us calculate p-values.
Common t-tests:
t-tests for mean (unknown \(\sigma\))
t-test for regression coefficients
t-test for difference between means
# is the mean weight of bumblebees in florida different than the mean bumblebee weight of 175mg?
t.test(data,
alternative = "two.sided",
mu = 175)
One Sample t-test
data: data
t = -0.3004, df = 99, p-value = 0.7645
alternative hypothesis: true mean is not equal to 175
95 percent confidence interval:
172.7395 176.6660
sample estimates:
mean of x
174.7028
library(TOSTER)
tsum_TOST(m1 = mean(data),
mu = 175,
sd1 = sd(data),
n1 = length(data),
low_eqbound = 170,
high_eqbound = 180)
One-sample t-test
The equivalence test was significant, t(99) = 4.753, p = 3.4e-06
The null hypothesis test was non-significant, t(99) = -0.300, p = 7.65e-01
NHST: don't reject null significance hypothesis that the effect is equal to 175
TOST: reject null equivalence hypothesis
TOST Results
t df p.value
t-test -0.3004 99 0.765
TOST Lower 4.7530 99 < 0.001
TOST Upper -5.3538 99 < 0.001
Effect Sizes
Estimate SE C.I. Conf. Level
Raw -0.2972 0.9894 [173.0599, 176.3456] 0.9
Hedges's g 17.5226 1.2431 [15.4236, 19.5332] 0.9
Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").
# is the difference in mean weights of bumblebees in florida and california different than 0?
t.test(x = data1,
y = data2,
alternative = "two.sided")
Welch Two Sample t-test
data: data1 and data2
t = -4.622, df = 195.18, p-value = 6.898e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.118912 -3.262520
sample estimates:
mean of x mean of y
174.5142 180.2050
city temperature rainfall bee_weight
1 City 1 24.64579 933.8643 9.294917
2 City 1 22.91725 1049.5396 3.650119
# fit model
m1 <- lm(bee_weight ~ rainfall + temperature,
data = simulated_data)
# regression table
summary(m1)
Call:
lm(formula = bee_weight ~ rainfall + temperature, data = simulated_data)
Residuals:
Min 1Q Median 3Q Max
-7.0941 -1.3528 -0.0109 1.3792 8.6508
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0597404 0.1251260 40.44 <2e-16 ***
rainfall -0.0050555 0.0001271 -39.78 <2e-16 ***
temperature 0.2991710 0.0045176 66.22 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.999 on 9997 degrees of freedom
Multiple R-squared: 0.3112, Adjusted R-squared: 0.311
F-statistic: 2258 on 2 and 9997 DF, p-value: < 2.2e-16
university age group gender caffeine_mg gestures
1 University 3 25.11077 Group B Woman 151.92202 43
2 University 3 23.72048 Group B Woman 74.42176 29
# does caffeine intake influence the number of gestures used in conversation?
library(lme4)
# z score
data <- data |> mutate(across(all_of(c("age", "caffeine_mg")),
~ (.-mean(.)) / sd(.)))
m2 <- glmer(gestures ~ age + gender + caffeine_mg + (1 | university),
data = data,
family = poisson)
m2 |>summary()Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: poisson ( log )
Formula: gestures ~ age + gender + caffeine_mg + (1 | university)
Data: data
AIC BIC logLik deviance df.resid
641.5 657.1 -314.7 629.5 94
Scaled residuals:
Min 1Q Median 3Q Max
-2.42780 -0.57690 0.04138 0.61653 2.07566
Random effects:
Groups Name Variance Std.Dev.
university (Intercept) 0.0001464 0.0121
Number of obs: 100, groups: university, 5
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.72267 0.03285 113.340 <2e-16 ***
age -0.02396 0.01694 -1.415 0.1572
genderNon-binary -0.10289 0.04393 -2.342 0.0192 *
genderWoman -0.09640 0.04279 -2.253 0.0243 *
caffeine_mg 0.04686 0.01654 2.834 0.0046 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) age gndrN- gndrWm
age -0.225
gndrNn-bnry -0.761 0.250
genderWoman -0.777 0.237 0.617
caffeine_mg 0.105 0.029 -0.146 -0.119
\(\chi^2\)-statistic:
\[ \chi^2 = \frac{(o-e)^2}{e}\\ \underbrace{\chi = \frac{(o-e)}{\sqrt{e}}}_\text{for demo purposes only} \]
where \(o\) is the observed sample statistic, \(e\) is the sample statistic expected under \(H_0\)
\(\chi^2\)-statistics also measure how far away the sample statistic is from the expected \(H_0\) statistic.
❓when sample statistics are compatible with \(H_0\), what type of \(\chi^2\)-statistics would you expect to see?
Common \(\chi^2\) tests:
Homogeneity: do two groups have the same distribution for one variable in the dataset?
Independence: are two variables in a dataset related?
Goodness of Fit: does one variable in the dataset match our expected distribution?
“I am satisfied with the quality of my healthcare”
| Strong Agree | Agree | Neither | Disagree | Strong Disagree | |
|---|---|---|---|---|---|
| Blue Shield | 15 | 10 | 40 | 20 | 15 |
Test of GOF: does the distribution of responses match our goal distribution of 20/25/30/15/10%?
| Strong Agree | Agree | Neither | Disagree | Strong Disagree | |
|---|---|---|---|---|---|
| Blue Shield | 15 | 10 | 40 | 20 | 15 |
| EXPECTED | Strong Agree | Agree | Neither | Disagree | Strong Disagree |
|---|---|---|---|---|---|
| Blue Shield | 100 * 0.2 | 100 * 0.25 | 100 * 0.3 | 100*0.15 | 100*0.1 |
\(\chi^2\)-statistic: \(\chi^2 = \frac{(o-e)^2}{e}\)
Chi-squared test for given probabilities
data: c(15, 10, 40, 20, 15)
X-squared = 17.75, df = 4, p-value = 0.001381
“I am satisfied with the quality of my healthcare”
| Strong Agree | Agree | Neither | Disagree | Strong Disagree | |
|---|---|---|---|---|---|
| Blue Shield | 15 | 10 | 40 | 20 | 15 |
| United Healthcare | 10 | 5 | 20 | 40 | 25 |
Test of Homogeneity: Is the distribution of responses different for the two groups?
Pearson's Chi-squared test
data: summary_data
X-squared = 18.5, df = 4, p-value = 0.0009851
# is the proportion of heads different than 0.5?? Do we have a fair coin?
prop.test(mean(dat), # prop heads
length(dat), # num flips
p = 0.5, alternative = "two.sided")
1-sample proportions test with continuity correction
data: mean(dat) out of length(dat), null probability 0.5
X-squared = 95.492, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
4.761309e-05 5.674451e-02
sample estimates:
p
0.0064
# Set seed for reproducibility
set.seed(42)
# Define the expected color distribution (e.g., based on a typical Skittles bag)
expected_proportions <- c(red = 0.2, orange = 0.2, yellow = 0.2, green = 0.2, purple = 0.2)
# Simulate the actual observed counts of Skittles in a bag (e.g., a bag of 100 Skittles)
total_skittles <- 100
observed_counts <- rmultinom(1, total_skittles, prob = expected_proportions)[,1]
# print
print("Observed counts:")[1] "Observed counts:"
red orange yellow green purple
26 24 15 20 15
# test
expected_counts <- total_skittles * expected_proportions
chisq_test <- chisq.test(x = observed_counts, p = expected_proportions)
print("Chi-square GOF test result:")[1] "Chi-square GOF test result:"
Chi-squared test for given probabilities
data: observed_counts
X-squared = 5.1, df = 4, p-value = 0.2772
# Set seed for reproducibility
set.seed(123)
# Define the number of students in each group
n_eecs <- 100
n_cads <- 100
# Define the expected distributions of undergraduate majors
prob_eecs <- c(math = 0.3, business = 0.1, humanities = 0.05,
computer_science = 0.4, data_science = 0.15)
prob_cads <- c(math = 0.2, business = 0.25, humanities = 0.1,
computer_science = 0.25, data_science = 0.2)
# Simulate data using multinomial distribution
observed_eecs <- rmultinom(1, n_eecs, prob = prob_eecs)[,1]
observed_cads <- rmultinom(1, n_cads, prob = prob_cads)[,1]
# Combine data into a table for the chi-square test
observed_data <- rbind(EECS = observed_eecs, CADS = observed_cads)
colnames(observed_data) <- c("Math", "Business", "Humanities",
"Computer Science", "Data Science")
# print
print("Observed data for EECS and CADS students:")[1] "Observed data for EECS and CADS students:"
Math Business Humanities Computer Science Data Science
EECS 30 9 8 33 20
CADS 13 27 15 25 20
[1] "Chi-square test of homogeneity result:"
Pearson's Chi-squared test
data: observed_data
X-squared = 18.955, df = 4, p-value = 0.0008022